' "Application No. 10/724,108 

Reply to Office Action of November 15, 2006 

REMARKS/ARGUMENTS 

Claims 14-25 are active in this case. 

Support for the amendments to Claims 14 and 20, i.e., the gene/enzymes listed in the 
claims are specifically identified in pages 10-21 of the application. 

Claim 20 has been amended to clarify that the method transforms a cell in which one 
or more the genes identified have been deleted or inactivated. Accordingly, the rejection 
under 35 USC 1 12, second paragraph is no longer applicable. 

The specification is amended to provide a cross-reference to related applications. 
No new matter is believed to have been added by the addition of these amendments. 

In the Official Action, the Examiner has maintained the written description rejection 
(35 U.S.C. § 1 12, first paragraph) because he has taken the position that the specification 
does not provide adequate description for all of the possible genes encoding pyruvate 
decarboxylase, aspartic protease, serine protease, aminopeptidase, and carboxypeptidase as 
defined in the claims. This rejection is believed to be no longer applicable as the claims have 
been amended to can define the genes listed in the Examples on Pages 10-21, i.e., dipeptidyl 
aminopeptidase (SPC14C4.15), cytoplasmic aminopeptidase (SPAC13A1 1.05), aspartic 
protease (SPCC 1795.09), pyruvate decarboxylase pdcl (SPAC1F8.07), serine protease isp 6 
(SPAC4AF8.04), aminopeptidase (SPC4F10.02), carboxypeptidase (SPBC16G5.09), 
carboxypeptidase (SPBC337.07c), vacuolar carboxylase S (SPAC24C9.08), zinc protease 
(SPCUNK4.12C), zinc protease SPCC 1442.07c), metalloprotease (SPCC965.04c), zinc 
metalloprotease (SPAC17A5.04c), CAAX prenyl protease I (SPC3G1.05), dipeptidyl 
peptidase (SPBC171 1.12), dipeptidase (SPCC965.12), methionine metallopeptidase (SPEC 
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14C8.03), methionine aminopeptidase (SPBC3E7.10), signal peptidase (SPAC 107 1.04c), and 
mitochondrial peptidase p subunit (SPBP23A10.15c). 

Moreover, the references for the genes in the Examples ("SPC," "SPAC" etc), relate 
to open reading frames from the genome sequence of Schizosaccharomyces pombe reported 
in the journal Nature 415 (6874), 871-880 (2002), copy attached (see also page 12, lines 8-9). 

Withdrawal of this rejection is requested. 

The Examiner has maintained that the YAP3 protease described in Egel-Matani is 
an "aspartic protease" and therefore, meets the definition of the claims notwithstanding the 
discussion in col. 2, lines 18-19 in which the enzyme is characterized as cleaving arginine . 
However, it would appear that while Egel-Matani describes a S. ceriviseae YAP3, there is no 
disclosure for S, pombe YAP3-type proteases and certainly not the specific aspartic protease 
SPCC 1795.09 as described in the specification on page 16, line 1 and listed in the claims. 
Accordingly, withdrawal of this rejection is requested. 

Regarding the rejection based on Simeon, the publication does appear to describe a 
CPY serine protease and notwithstanding our view of the data Simeon presents, the Examiner 
has taken the position that the S, pombe strain inherently over-expresses the S, cerevisiae 
CPY introduced therein. However, what Simeon does not describe is the specific serine 
protease isp6 (SPAC1F8.07) has a distinct structure (i.e., sequence). In this regard. 
Applicants attach the sequences of Schizosaccharomyces pombe cpyl gene for 
carboxypeptidase Y (PubMed accession No. D86560) used in Simeon and the serine protease 
isp6 referenced on page 10, line 16 and listed in the pending claims (i.e.,SPAC4A.04). 
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The Examiner has also rejected Claims 14, 18, 20 and 24 as being obvious in view 
of WO 00/42203 in view of Giga-Hama et al. The rejection is based on the allegation that 
one would have applied the techniques described in WO 00/42203 to the S, pombe cells in 
Giga-Hama. This rejection is no longer applicable in light of the amended claims submitted 
herein, and particularly, because these two publications do not describe or suggest the 
specific genes/enzymes defined in Claims 14 and 20. 



As for the required substitute Application Data Sheet, this was to correct Mr. Tohda's 
name and another mark-up is attached. 

A Notice of Allowance for all pending claims is eamestly solicited. 

Should the Examiner deem that any further action is necessary to place this 
application in even better form for allowance, he is encouraged to contact Applicants' 
undersigned representative. 



Withdrawal of the rejection is requested. 



Respectfully submitted. 
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NM_001019245 1404 bp mRNA linear PLN 20-JUN-2005 

Schizosaccharomyces pombe 972h- hypothetical protein (SPAC4A8 . 04) , 
partial mRNA. 
NM_001019245 

NM_001019245.1 01:67999866 

Schizosaccharomyces pombe 972h- 
Schizosaccharomvces pombe 972h- 

Eukaryota; Fungi; Ascomycota; Schizosaccharomycetes; 
Schizosaccharomycetales ; Schizosaccharomycetaceae ; 
Schizosaccharomyces . 
1 

Wood, v., Gwilliam,R., Ra j andr eam , M . A . , Lyne,M., Lyne,R., 
Stewart, A., Sgouros,J., Peat,N., Hayles,J., Baker, S., Basham,D., 
Bowman, S., Brooks, K,, Brown, D., Brown, S., Chillingworth, T. , 
Churcher,C., Collins, M., Connor, R., Cronin,A., Davis, P., 
Feltwell,T., Eraser, A., Gentles, S., Goble,A,, Hamlin, N., Harris, D,, 
Hidalgo, J., Hodgson, G., Holroyd,S., Hornsby,T. , Howarth, S. , 
Huckle,E.J,, Hunt,S., Jagels,K., James, K., Jones, L., Jones, M., 
Leather, S., McDonald, S., McLean, J., Mooney,P., Moule,S., 
Mungall,K., Murphy, L., Niblett,D., Odell,C., Oliver, K., 0'Neil,S., 
Pearson, D., Quail, M. A., Rabbinowitsch, E . , Rutherford, K. , Rutter,S., 
Saunders, D., Seeger,K., Sharp, S., Skelton,J., Simmonds,M., 
Squares, R., Squares, S., Stevens, K., Taylor, K., Taylor, R.G., 
Tivey,A., Walsh, S., Warren, T., Whitehead, S . , Woodward, J., 
Volckaert,G. , Aert,R., Robben,J., Grymonprez,B. , Weltjens, I . , 
Vanstreels,E. , Rieger,M., Schafer,M., Muller-Auer, S . , Gabel,C., 
Fuchs,M., Dusterhoft,A. , Fritzc,C., Holzer,E., Moestl,D., 
Hilbert,H,, Borzym,K., Langer,!., Beck, A., Lehrach,H., 
Reinhardt,R. , Pohl,T.M., Eger,P., Zimmermann, W. , Wedler,H., 
Wambutt,R., Purnelle,B., Goffeau,A., Cadieu,E., Dreano,S., 
Gloux,S., Lelaure,V., Mottier,S., Galibert,F., Aves,S.J., Xiang,Z., 
Hunt,C., Moore, K., Hurst, S.M., Lucas, M., Rochet, M., Gaillardin,C. , 
Tallada,V.A. , Garzon,A., Thode,G., Daga,R.R., Cruzado,L., 
Jimenez, J., Sanchez, M., del Rey,F., Benito, J., Dominguez, A. , 
Revuelta, J.L. , Moreno, S., Armstrong, J. , Forsburg,S.L. , Cerutti,L., 
Lowe,T., McCombie,W.R. , Paulsen,!., Potashkin, J. , Shpakovski, G. V. , 
Ussery,D., Barrell, B.G. , Nurse, P. and Cerrutti,L. 
The genome sequence of Schizosaccharomyces pombe 
Nature 415 (6874), 871-880 (2002) 
11859360 

2 (bases 1 to 1404) 
NCBI Genome Project 
Direct Submission 

Submitted (03 - JUN-2005) National Center for Biotechnology 
Information, NIH, Bethesda, MD 20894, USA 
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PROVrsiONAL REFSEQ ; This record has not yet been subject to final 
NCBI review.* This record is derived from an annotated genomic 
sequence (NC_003424) . 
COMPLETENESS: not full length. 

Location/Qualifiers 

1. .1404 

/organism="Schizosaccharomyces pombe 972h-" 

/mol_type= "mRNA" 

/strains "972h-" 

/db_xref = " taxon : 284812 " 

/ chromosomes " I " 

1..1404 

/gene="isp6" 

/locus_tag= " SPAC4 A8 . 04 " 

/note= " synonym : prbl " 

/db_xref = "GenelD : 2543097 " 

l.,1404 

/gene="isp6" 

/locus_tag="SPAC4A8 .04" 

/note=" involved in sexual differentiation; subtilase-type 
protease; involved in RNA degradation via regulation of 
the RNA degrading activity of Pnulp (PMID 11872168) ; 
similar to S. cerevisiae PRBl" 
/codon_start=l 

/products "hypothetical protein" 
/protein id=" NP 593815.1 " 
/db_xref="GI : 19114727 " 
■ /dbxref ="GeneID ; 2543097 " 

/translations ''MRIPYSNLFSAAAGLALFASTACAAPVMPATDSDIAHAGIRPEL 
DNAFYDSHGEAATPKHKPHAGPNAAPLLSASNADTTGLDSHYI IVLQPDLSEQEFQAH 
TNWSEMHQMDIASQEDEYYDTSDSNYMFGLKHVYDFGEDSFKGYSGQFSSNIVEQIR 
LHPHVIAVERDQWSIKKLETQSGAPWGLARISHKSVKYDDIGKYVYDSSAGDNITAY 
WDTGVSIHHVEFEGRASWGATIPSGDVDEDNNGHGTHVAGTIASRAYGVAKKAEIVA 
VKVLRSSGSGTMADVlAGVEWTVRHHKSSGKKTSVGNMSLGGGNSFVLDMAVDSAV^ 
GVIYAVAAGNEYDDACYSSPAASKKAITVGASTINDQMAYFSNYGSCVDIFAPGLNIL 
STWIGSNTSTNTISGTSMATPHVAGLSAYYLGLHPAASASEVKDAIIKMGIHDVLLSI 
PVGSSTINLLAFNGAQE " 



ORIGIN 



// 



1 atgagaattc cttattcaaa tcttttttct gccgccgcag gtttggccct cttcgcttct 

61 actgcctgtg ctgcaccagt gatgccagcg actgattcgg acattgccca tgctggtatc 

121 cgtcctgagc tcgataacgc tttctacgac tcccacggcg aagccgctac ccctaagcac 

181 aaacctcatg ctggtcctaa tgccgctcct ctcttgtctg cttccaacgc cgataccact 

241 ggactggact ctcactatat cattgttttg cagcctgatt tgagcgaaca agaattccaa 

301 gcccatacta attgggtctc ggagatgcac caaatggaca ttgcttctca agaagatgag 

361 tactacgata ctagtgatag taattacatg tttggcttga agcatgttta tgactttggc 

421 gaggactctt tcaaaggtta ttctggtcaa ttcagctcta acattgttga gcaaattcgc 

481 ttgcatcctc atgtcatcgc cgtcgagcgt gatcaagttg tcagcattaa gaaacttgaa 

541 actcaaagtg gcgctccttg gggacttgct cgcatctctc acaaatccgt caaatacgat 

601 gatattggca aatatgttta cgattccagc gctggtgaca acatcaccgc ttatgttgta 

661 gataccggtg taagcattca tcatgttgag ttcgaaggtc gcgcttcttg gggtgcaacc 

721 attccctctg gtgatgttga tgaggataac aatggtcatg gtacgcatgt tgctggtacc 

781 attgctagcc gtgcttatgg tgttgcaaag aaggctgaaa tcgttgctgt caaggttctc 

841 cgttccagtg gatctggtac catggctgat gtgattgccg gtgttgagtg gactgttcgt 

901 catcacaaat cgtccggcaa gaaaacctct gttggcaaca tgtctcttgg tggcggcaac 

961 agctttgttt tggatatggc tgttgattct gccgttacca acggtgttat ttatgccgtt 

1021 gctgctggaa atgagtatga tgacgcttgc tattcatctc ctgctgcttc taagaaagcc 

1081 atcaccgttg gcgcttccac tataaacgac caaatggcct acttctctaa ctatggtagc 

1141 tgtgttgaca tcttcgctcc tggacttaac attttgtcta catggatcgg ttcaaatact 

1201 agtactaaca ccatctctgg tacttctatg gcaacccctc atgttgcagg tttgtccgct 

1261 tactaccttg gcctacaccc tgctgccagc gctagcgaag ttaaagatgc tatcattaag 

1321 atgggtattc acgatgtact cttgtctatt cctgttggta gcagcactat taaccttctc 

1381 gctttcaatg gtgctcaaga atag 
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D86560 4308 bp DNA linear PLN 14-APR-1998 

Schizosaccharomyces pombe cpyl gene for carboxypeptidase Y, 
complete cds. 
D86560 

D86560.1 GI:3046860 
cpyl; carboxypeptidase Y. 
Schizosaccharomyces pombe (fission yeast) 
Schizosaccharomyces pombe 

Eukaryota; Fungi; Ascomycota; Schizosaccharomycetes; 
Schizosaccharomyce tales; Schizosaccharomycetaceae; 
Schizosaccharomyces . 
1 

Tabuchi,M., Iwaihara,0., 0htani,Y., Ohuchi,N., Sakurai,J., 
Mor i ta , T . , Iwahara , S . and Takegawa , K . 

Vacuolar protein sorting in fission yeast: cloning, biosynthesis, 

transport, and processing of carboxypeptidase Y from 

Schizosaccharomyces pombe 

J. Bacterid. 179 (13), 4179-4189 (1997) 

9209031 

2 

Takegawa , K . , Tabuchi , M . , Iwaihara , O . , Mori ta , T . and Iwahara , S . 
Cloning and characterization of cpyl gene from Schizosaccharomyces 
pombe 

Unpiablished 
3 (bases 1 to 4308) 
Takegawa , K . 
Direct Submission 

Submitted (16- JUL-1996) Kaoru Takegawa, Kagawa University, 
Department of Bioresource Science, Faculty of Agriculture; Ikenobe 
2393, Kita-gun, Miki-cho, Kagawa 761-07, Japan 
(E-mail : takegawaOag . kagawa-u. ac . jp, Tel : 087-891-3116 , 
Fax:087-898-7295) 

On Apr 15, 1998 this sequence version replaced gi: 2274843. 
Sequence updated (08-Apr-1998) . 

Location/Qualifiers 

1..4308 

/organisms "Schizosaccharomyces pombe" 

/mol_type= "genomic DNA" 

/ db__x r e f = " t axon : 4896 " 

812. .3820 

/gene="cpyl" 

812. .3820 

/gene="cpyl" 

/codon_start=l 

/product=" carboxypeptidase Y" 
/protein_id= " BAA25568.1 " 



BEST AVAILABLE COPY 



•/db'^xref="GI: 3046861" 
/trarislat ion= "MLMKQTFLYFLLTCVVSAQFNGYVPPEQNGGDIWPKDFYEKFG 
EDFIREQEESSAPIJyiNPVPERDEAEAPHHPKGHHEFNDDFEDDTALEHPGFKDKIiDSF 
LQPARDFLHTVSDRLDNIFDDDEDEHVREKRPHDSADEDAPRRKHGKCKGKGKHHKGK 
HAKGKGKKSHPKPEDDSVFFDDERPKHHEFDDEDREFPAHHEPGEHMPPPPMHHKPGE 
HMPPPPMHHEPGEHMPPPPMHHEPGEHMPPPPMHHEPGEHMPPPPMHHEPGEHMPPPP 
MHHEPGEHMPPPPMHHEPGEHMPPPPMHHEPGEHMPPPPMHHEPGEHMPPPPMHHEPG 
EHMPPPPFKHHELEEHEGPEHHRGPEDKEHHKGPKDKEHHKGPKDKEHHKGPKDKEHH 
KGPKDKEHHKGPKDKEHHKGPKDKEHHQGPKEKHNERPEQNMQSSHELLVIEAFADLI 
NSVPVEEIAEEFSRFLDTLGIEYYGNIPVHIQENAPKDSSIPPLFEFDDDLELSDLTP 
EQFAYLEMLKAEGIDPMTAFRDQSHPAKPSNAQPADSSRPYAVFSQEENGEHVNLKAF 
PDHTLRVKDSKPESLGIDTVKQYTGYLDVEDDRHLFFWFFESRNDPENDPVVLWIjNGG 

pgcssltglfmelgpssinietlkpeynphswnsnasvifldqpintgfsngddsvld 
tvtagkdvyaflnlffakfpqyahldfhiagesyaghyipqfakeimehnqganffva 
sgyemekqyinlksvligngltdplvqyyfygkmacespygpimsqeecdritgaydt 
caklitgcyqtgftpvcigaslycnnamigpftktglniydireecrdqehlcypetg 
aiesylnqefvqealgveydykgcntevnigflfkgdwmrktfrddvtaileaglpvl 
iyagdadyicnymgneawtdalewagqrefyeaelkpwspngkeagrgksfknfgylr 
lyeaghmvpfnqpeaslemiinswidgslfa" 

ORIGIN 

1 agtacttgat gactgcaaaa caattaataa caagagcgtg tttttatatt ttatactgtc 
61 tgtttaaggt ttaattgaat ccaataattc aactaagaat aacttgatcc acatgtttac 
121 ctttcttttt gtttaataaa tattatcaga tgtgattttc ctgtttgcta actgcaagcg 
181 aagctacttt tttcattata tttcgttatt agcgtataaa gacattgtaa agtatttgca 
241 gcgatatcta aggaggattg cgaaggggct tagaacatga tgattcatgc gaggcgtctg 
301 cgtcacgctt ctggacttat gcttaaccct tcatatatat agcgattgtt gcctgccgct 
361 tacttttctc cttaaccaag cacacattaa cgccatcgaa ccctagaacc agtaaagtga 
421 tctgaagagg gaaaagcagc aaaacgatcg taaatacgtt tgtttaagaa tttgctactt 
481 tttgtggttt tctgtctagt tatttattgc gatatccaca ttgcctttag tgcttgtaat 
541 cacactgtaa gacgtgaggc ttcgcaatcg acattgagtt aaatacattc tttttctcat 
601 gtttattttt ctttttaatc tatccatata gctcattttg cctaacagtt atctgatttg 
661 gctaatttgc aatttctttt ttaatccagt acatacagat ttgtatctat caattgattg 
721 tttgagaagt tcggtctatt tggttagctt gttgtacgaa tatttcattt ccttttgttt 
781 ctctcctttc tatactgtac attgaaacaa gatgttaatg aaacaaacct tcttgtactt 
841 tttgctcact tgcgtcgtat ccgctcagtt taacggttat gttcctcctg agcaaaatgg 
901 tggtgatatc gtcgttccca aagattttta cgagaaattt ggggaggatt tcatccgtga 
961 gcaagaggaa agctctgctc ctcttatgaa tcccgttccc gagcgtgatg aagctgaggc 
1021 tcctcatcac cccaagggtc atcacgagtt taatgatgac tttgaagatg atactgcctt 
1081 agaacaccct ggatttaaag acaagcttga ctctttcctt cagcctgctc gtgatttcct 
1141 tcacaccgtt tctgatcgtc tcgacaatat ctttgacgat gatgaggatg aacacgttcg 
1201 cgagaagcgc cctcatgact cagctgatga ggatgctccc cgcagaaagc acggtaaatg 
1261 caaaggaaaa ggaaagcacc ataagggtaa acatgctaag ggaaagggaa agaagtctca 
1321 ccctaaaccc gaggatgact ctgttttctt tgatgacgag cgtcccaagc atcatgaatt 
1381 tgacgatgag gatcgagagt tccctgctca tcacgagcct ggtgagcaca tgcctcctcc 
1441 tcctatgcac cacaagcccg gtgagcacat gcctcctcct cctatgcacc acgaacctgg 
1501 agagcacatg cctcctcctc ctatgcacca cgaacctgga gagcacatgc ctcctcctcc 
1561 tatgcaccac gaacctggag agcacatgcc tcctcctcct atgcaccacg aacctggaga 
1621 gcacatgcct cctcctccta tgcaccacga acctggagag cacatgcctc ctcctcctat 
1681 gcaccacgaa cctggagagc acatgcctcc tcctcctatg caccacgaac ctggagagca 
1741 catgcctcct cctcctatgc atcacgagcc tggagagcac atgcctcctc ctcctatgca 
1801 tcacgagcct ggagagcaca tgcctcctcc tcctttcaaa caccatgagc ttgaggagca 
1861 tgaaggtcct gagcatcatc gtggacctga ggacaaggaa caccataagg gacccaagga 
1921 taaggagcac cataagggac ctaaggataa ggagcaccat aagggaccca aggataagga 
1981 gcaccataag ggacctaagg ataaggagca ccacaaggga cctaaggaca aggagcacca 
2041 caagggacct aaggacaagg agcaccatca aggacctaag gagaagcaca acgaacgtcc 
2101 tgagcaaaac atgcaaagtt ctcatgagct tttggtcatt gaggctttcg ctgacctcat 
2161 caattctgtt cctgttgaag aaattgccga agagttttct cgctttttag acactcttgg 
2221 tattgaatac tatggtaaca ttcctgtaca tattcaagaa aatgccccaa aggattcatc 
2281 aattcccccc ctatttgaat tcgacgatga tttggagttg agtgatctca ctcctgaaca 
2341 atttgcctac cttgaaatgc tcaaggctga aggtattgat cctatgactg ctttccgtga 
2401 ccagagtcat cctgctaagc catctaatgc tcaacctgct gattcttcac gtccctacgc 
2461 tgttttttca caagaagaga atggtgaaca tgtaaattta aaggctttcc ctgatcacac 
2521 tcttcgcgtt aaagattcca aacctgaatc acttggtatt gacactgtta agcaatacac 



2581 cggttactta gatgtcgaag atgacagaca 
2641 tgatcccgag aatgatcccg tcgtgttgtg 
2701 tactggtttg ttcatggaat taggtccttc 
2761 atataaccct cacagttgga actccaatgc 
2821 cacgggtttc agcaacggag atgactcggt 
2881 ttatgcattc ttgaaccttt tctttgccaa 
2941 cattgctggt gaatcctatg ctggccatta 
3001 gcataaccaa ggtgctaact tctttgttgc 
3061 caatttgaag agtgtcttga ttggaaatgg 
3121 ttacggaaaa atggcttgcg agagccctta 
3181 tcgcattact ggtgcctatg atacctgcgc 
3241 ctttactcct gtttgcattg gtgcctcttt 
3301 tactaagact ggactcaaca tttatgatat 
33 61 atgctacccc gaaaccggtg caattgagag 
3421 tttgggagtt gaatacgatt acaagggatg 
3481 caagggtgac tggatgcgta agactttccg 
3541 cctacccgtt cttatctatg ccggtgatgc 
3601 agcttggacc gacgcacttg agtgggctgg 
3661 gccttggagt cctaatggaa aggaagctgg 
3721 tcttcgcctc tacgaagctg gtcacatggt 
3781 aatgttgaac agctggatag atggttctct 
3841 ttacatctac tgaatagttg atatgaaccg 
3901 agccataatt aatctttgcg ttttattaat 
3961 tgatgaatta cataaatatt tgtttttgta 
4021 ttcactgtaa gtatttcttc aattctcgcg 
4081 taattcccga ccaatattac tacgccgata 
4141 atgtaaacag tcgattgtaa ttgaaataga 
4201 attatcttga tcacataaaa gtattactca 
4261 tgttgaacac ctaaggtacc aagttcgtca 



tcttttcttc tggttctttg aatctagaaa 
gttgaacggt ggtcctggtt gctcttccct 
ttcaatcaac attgagactc ttaaacccga 
ttcagttatc tttttggatc aacctatcaa 
tcttgacact gttacggctg gtaaggatgt 
gttccctcag tacgctcatt tggactttca 
catcccccag tttgccaagg aaattatgga 
cagcggctat gaaatggaga agcaatacat 
tttgactgat cctttggtcc aatactactt 
cggtcctatt atgtcccaag aggaatgtga 
taagctaatc actggctgtt accagactgg 
gtattgcaat aacgctatga ttggaccatt 
tcgtgaagaa tgccgtgacc aagagcactt 
ttacttgaac caagaatttg ttcaagaagc 
caatactgaa gtaaacattg gtttcctttt 
tgacgatgtc accgcaatct tagaagctgg 
tgactacatt tgcaattaca tgggcaatga 
tcaacgtgag ttttatgagg ccgaattgaa 
tcgtggtaag tctttcaaaa actttggtta 
tcctttcaac caacccgaag ctagtttaga 
ttttgcttaa agtgtcatca gttggacatt 
aatatgactg actattggct aatatacagt 
taaattgatt ttaatatttg aaagatataa 
gtagtagtaa gtaacgctgt tggaaaactt 
gtttgtgtcg taatagtaac aacatagtta 
attactgcat acaaaaattt ttggtttaag 
ttgtccataa ctaagtgctt cgtaataaaa 
tccatgtgct tcataaatgt acgcttgtga 
atgtcagcct tgtctaga 
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We have sequenced and annotated the genome of fission yeast {Schizosaccharomyces pombe), which contains the smallest 
number of protein-coding genes yet recorded for a eukaryote: 4,824. The centromeres are between 35 and 110 kilobases (kb) and 
contain related repeats including a highly conserved 1.8-kb element Regions upstream of genes are longer than in budding yeast 
[Saccharomyces cerevisiae), possibly reflecting more-extended control regions. Some 43% of the genes contain Introns, of which 
there are 4,730. Fif^ genes have significant similarity with human disease genes; half of these are cancer related. We identify 
highly conserved genes important for eukaryotic cell organization including those required for the cytoskeleton, 
compartmentation, cell-cycle control, proteolysis, protein phosphorylation and RNA splicing. These genes may have originated 
with the appearance of eukaryotic life. Few similarly conserved genes that are important for multicellular organization were 
identified, suggesting that the t^nsition from prokaryotes to eukaryotes required more new genes than did the transition from 
unicellular to multicellular organization. 



We report here the completion of the fiilly annotated genome 
sequence of the simple eukaryote Schizosaccharomyces pombe, a 
fission yeast. It becomes the sixth eukaryotic genome to be 
sequenced, foUowing Saccharomyces cerevisiae^, Caenorhabditis 
elegancy Drosophila melanogaster^j Arabidopsis thaliana* and 
Homo sapieni'^. The entire sequence of the unique regions of the 
three chromosomes is complete, with gaps in the centromcric 
regions of about 40 kb, and about 260 kb in the telomeric regions. 
The completion of this sequence, the availability of sophisticated 
research methodologies, and the expanding community working on 
S. pombe^ will accelerate the use of S. pombe for functional and 
comparative studies of eukaryotic cell processes. 

Schizosaccharomyces pombe is a single-celled free living archias- 
comycete fungus sharing many features with cells of more compli- 
cated eukaryotes. From gene sequence comparisons and 
phylogenetic analyses, it has been suggested that fission yeast 
diverged from budding yeast around 330-420 million years (Myr) 
ago, and from Metazoa and plants around 1,000-1,200 Myr ago^ 
although a more recent estimate has put these times at 1,144 and 
1,600 Myr, respectively". Some gene sequences arc as equally 



•The Wdlcome Trust Sanger Institute. The Wellcome TVust Genome Campus, Hinxton, Cambridge 
CBIO ISA. UK. 'Cancer Research UK London Research Institute. Compuutional Genome AnalyiU 
Uboralory, 44 Lincoln's Inn Fields, London WC2A 3PX, UK. ^Cancer Research UK London Research 
Institute. CeO Cyde Uboratory. 44 Uncoln's Inn Fields. London. WC2A 3PX. UK. *Katholieke 
Unhrcntteit Leuven, Faculty of Agricultural and AppUed Bidogtcal Sciences, Uboratory of Gene 
Technology. Kardinaal Mercierlaan 92 Blok F, 8-3001 Leuven, Bdgium. *Cenolype GmbH. Molecular 
Biology and Biotech Research. AngdholWeg 39. D-69259 Wilhdmsfeld, Germany. *QIAGEN GmbH. Max 
Vobner Str. 4, D-40724 Hilden, Germany. 'Max-Planck-Institut fQr molekubre Genetik. Ihnestrasse 73, 
D- 14 195 Beriin. Germany. 'GATC Biotech AG, Jakob-Stadler-PUt* 7. D-78467 Konsunx, Germany. 
'AGOWA GmbH. Glienicker Weg 185. D-12489 Berlin, Germany, '^tahwrslte de Louvain. Unite de 
Biochimie Phynologique. Place Croix du Sud 2-20, B1348 Louvatn>b-Neuve. Belgium. "UMR 6061 
CNRS Genetique et developpement, Faculte de Medecine. 2 avenue du Professeur Leon Bernard, F-35043 
Rennes Cedex. France. "University of Exeter, School of Biotogical Sciences, Washington Singer Labora- 
tories, Perry Road. Exeter EX4 4QG. UK. "Genetique MolecuUire et CcUulaire. CNRS URA192S INRA 



diverged between the two yeasts as they are from their human 
homologues, probably reflecting a more rapid evolution within 
fungal lineages than in the Metazoa. S. pombe was first described in 
the 1890s and has been extensively studied since the 1950s' '°, 
resulting in the characterization of around 1,200 genes (http:// 
www.genedb.org/pombe). The ease with which it can be geneticaDy 
manipulated is second only to S. cerevisiae among eukaryotes and it 
has served as an excellent model organism for the study of ceD-cycle 
control, mitosis and meiosis", DNA repair and recombination *^ 
and the checkpoint controls important for genome stability". 

The 13.8-Mb genome of S. pombe is distributed between chro- 
mosomes I (5.7 Mb), II (4.6 Mb) and III (3.5 Mb)", together with a 
20-kb mitochondrial genome*^ Tandem arrays of 100-120 repeats 
of a 10.4-kb fragment containing the 5.8S, 18S and 25S ribosomal 
RNA genes account for around 1.1 Mb**. The three centromeres are 
35, 65 and UOkb long for chromosomes I, II and III, respectively, 
totaUing 0.2 Mb. This leaves about 12.5 Mb of unique sequence, 
similar in size to that of S. cerevisiae, and substantially smaller than 
those of the three other sequenced model eukaryotes, C. elegans 
(97 Mb), Arabidopsis ( 125 Mb) and Drosophila ( 137 Mb). All of the 
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unique sequence and most of the three centromeres of the Urs 
Leupold 972h' strain' have been sequenced by the Wellcome Trust 
Sanger Institute and the 13 other laboratories that make up the 
5. pombe European Sequencing Consortium (EUPOM), together 
with 100 kb of sequence generated by the Cold Spring Harbor 
Laboratory (GenBank accession numbers AL355920, AL355921, 
AL391034 and AL391016). Here, we present and discuss the 
genome sequence and composition, and carry out an initial over- 
view of gene function, making comparisons with other eukaryotic 
organisms, particularly 5. cerevisiae. 

Mapping, sequencing and sequence analysis 

A clone map was generated by the integration of the two pre- 
existing maps*^'*. End sequencing and restriction digestion of 
cosmids were used to construct a minimal tile path for sequencing. 
Problems with the earlier maps included the existence of chi- 
maeric clones, mismapped cosmids, bacterial insertion elements 
and unfilled gaps. Small gaps were covered using a long-range 
polymerase chain reaction (PCR) strategy, plasmid libraries, and a 
bacterial artificial chromosome (BAC) hbrary provided clones 
for gap closure across regions not represented in the cosmid 
libraries. The final 12.5-Mb sequence of the S. pombe genome is a 
composite of 452 cosmids, 22 plasmids, 15 BAC clones and 13 
PCR products. 

Most sequencing was performed using random sequencing of 
sub-cloned DNA followed by directed sequencing*'. DNA from 
clones was shattered (usuaDy by sonication) and fragments of 1.4- 
2kb were cloned, typically, into M13 or pUClS. Random sub- 
clones were sequenced with dye-terminator chemistry and analysed 
on automated sequencers. Most laboratories used Phred software 
for sequence base calling and Phrap or Gap4 for contig assembly^**. 
Gaps and low-quality regions of the sequence were resolved using 
primer walking, PCR and re-sequencing clones, under conditions 
that gave increased read lengths. Some laboratories also used direct 
blotting procedures, classical radioactive sequencing and nested 
deletions. AU sequences were finished to a high degree of accuracy, 
with at least two high-quality reads on each strand, or, if this could 
not be accomplished, an additional read on the same strand using an 
alternative chemistry. The depth of coverage was on average eight- 
fold. Sequences were collected centrally at the Wellcome Trust 
Sanger Institute, where the quality was examined by comparison 
of overlapping regions and by checking for frameshifb in coding 
regions. The sequencing error rate was less than 1 in 180,000 base 
pairs (bp), calculated from the number of single-base differences 
observed in overlapping sequences from different sources. All 
identified sequencing errors have been resolved with the exception 
of four sin^e-base differences found in homopolymeric tracts 
located outside coding regions, possibly generated by slippage 
during DNA replication. 

Gene prediction was carried out with GENEFINDER (R Green 
and L. Hillier, unpublished software) trained on experimentally 
confirmed 5. pombe genes to recognize intronic and coding regions. 
Additional information was provided using a Hidden Markov 
Model trained on intron sequences using HMMER (http:// 
hmmer. wustl.edu/hmmer-html/). Searches were performed against 
public databases (SWISS-PROT and TrEMBL'', EMBL'' and 
Pfam"), using BLAST", MSPcninch", FASTA" and Genewise". 
The predictions were refined manually within the Artemis analysis 
and annotation tool" using protein homology and expressed 
sequence tag (EST) data^'. Because most 5. pombe genes have a 
prospective homologue in other organisms, putative functions were 
assigned on the basis of similarities to known genes, using the 
SWISS-PROT^S Pfam", Proteome^°, SGD'' and MIPS databases". 
Identification of transfer RNA was carried out using the tRNA scan- 
SE software". 

Prediction of genes in fission yeast is a problem of intermediate 
complexity. It is more difficult than the analysis of tightly packed 



genomes that have little or no splicing, as found in prokaryotes and 
budding yeast, but less difficult than gene prediction in multi- 
cellular eukaryotes, which have lower gene density, high levels of 
splicing, and long introns. There are 4,730 confirmed and predicted 
introns in 5. pombe, many more than the 272 now predicted for 5. 
cerevisiae. S. pombe introns average only 81 nucleotides in length 
and so are shorter and easier to predict than those found in Metazoa 
and plants. Of the 4,730 introns in S. pombe, 638 have been 
confirmed experimentally by messenger RNA and EST data", and 
many more by homology. 

Genome content 

We predicted a maximum of 4,940 protein coding genes (including 
1 1 mitochondrial genes) and 33 pseudogenes. The three gene maps 
showing these predictions can be viewed at ftp://ftp.sanger.ac.uk/ 
pub/yeast/pombe/GeneMaps/. All open reading frames (ORFs) over 
100 amino acids with an initiator methionine and not overlapping 
with other known genes are included in this set. Also included are 
147 confirmed or predicted protein-coding sequences of 25-99 
amino acids. Any remaining undiscovered genes are likely to have 
either a highly spliced structure with small exons, or to be smaller 
than 100 amino acids. There are a further 1 16 questionable proteins 
considered less likely to be coding because they are small, have no 
detectable homologies, and display low coding potential. Removal 
of these questionable genes reduces the predicted gene complement 
from 4,940 to 4,824. 

Even our upper estimate of 4,940 genes for S. pombe is substan- 
tiaUy less than the 5,570-5,651 genes predicted for S. cerevisiae^^^ 
the 6,752 genes predicted for Mesorhizobium loti, the largest 
published prokaryote genome sequence to date'^ and the 7,825 
genes estimated in the 8.67-Mb genome of the prokaryote 
Streptomyces coelicolor (I. Parkhill and S. Bentley, personal commu- 
nication). We conclude that a free-living eukaryotic cell can be 
constructed with fewer than 5,000 genes, and that the distinction 
between eukaryotic and prokaryotic cell organization is not deter- 
mined simply by total number of genes but depends on the types of 
genes present and how they interact with each other and the 
environment. Comparing the genome content of species at different 
levels of organization, it seems that fewer than 500 genes are 
sufficient to generate a parasitic prokaryotic cell such as 
Mycoplasma genitalium^^, about 1,500 genes for a free-living pro- 
karyotic cell such as Aquifex aeolicu/\ 5,000 genes for a firee-living 
eukaryotic cell (S. cerevisiae and S, pombe; ref. 39 and this paper), 
and around 15,000 genes for multicellular eukaryotic organisms 
such as Drosophila and C. ekgan^^, whereas 30,000-40,000 genes 
gives rise to human consciousness^"*. 

Gene density is similar for chromosomes I and II, with one gene 
every 2,483 and 2,457 bp respectively, but is less dense for chromo- 
some in, at one gene every 2,790 bp. This is not due to differences in 
the average length of the genes, which are similar (1,407- 1,446 bp) 
for all three chromosomes (Table 1). Protein-coding genes are 
absent from the centromeres, although tRNA genes are found in 
these regions. Gene density is also lower at the telomeres. The gene 
density for the complete genome is one gene every 2,528 bp, 
compared with one gene every 2,088 bp for S. cerevisiae. The 
protein-coding sequence is predicted to occupy 60.2% (57% 
excluding introns) of the sequenced portion of the 5. pombe 
genome, compared with 71% in S. cerevisiae (70.5% excluding 
introns). The overall guanine and cytosine (GC) content is 36.0%, 
compared with 38.3% in S. cerevisiae, and for the protein-coding 
portion is identical in the two yeasts at 39.6%. 

We have identified a total of 1 74 tRNAs, 45 of which have introns; 
all the tRNA families needed to decode all codons are present. The 
spliceosomal RNAs (U1-U6) are found together with 16 small 
nudear RNA genes (snRNAs) and 33 small nucleolar RNAs (sno- 
RNAs). These are dispersed mostly as singletons throughout the 
genome. The 5.8S. 18S and 26S ribosomal RNA genes are grouped 
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Table 1 Genome content for the three chromosomes 




Length (bp) 


No. Of genes 


No.ofTQs 


No. of 


No.ofwtfs 


No. of 


No. of 


Mean gene 


Qenedensit/t 


Coding (%) 






pseudoTf2s 




loneLTRs 


pseudogenes 


length (bp)* 






Chromosome 1 


5.598.923 


2.255 


8 


0 


1 


77 


17 


1.446 


2.483 


58.6 


Chromosome 2 


4,397.795 


1.790 


2 


1 


1 


53 


9 


1.411 


2.457 


57.5 


Chronnosome 3 


2.465.919 


884 


1 


2 


23 


50 


7 


1.407 


2.790 


54.5 


Whole genome 


12.462.637 


4.929 


11 


3 


25 


180 


33 


1.426 


2.528 


57.5 



*Mean gene length excluding jntrons. 
tGene density, given as averege bp per gene. 



together as 100-120 tandem repeats in two arrays on chromosome 
but the thirty 5S ribosomal RNA genes are distributed 
throughout the genome*', providing opportunities for unequal 
crossing over when they are in tandem orientation and close 
proximity. This can lead to local duplications and deletions of 
genes located between the 5S RNA genes^l There are 11 intact 
transposable elements (Tf2 type) (Table 1), accounting for 0.35% of 
the genome. This is significantly less than the 2.4% (59 elements) 
found in S. cerevisiae*^ and the 10% found in Arabidopsis*, and is 
also likely to be much less than the numbers in Drosophila and 
humans^•*^ There are 25 wtf elements (Vith tfl- or tf2-type' long 
terminal repeats, LTRs), which appear to be spliced membrane 
proteins of S. pombe. These elements are often flanked by LTRs, and 
so may have been duplicated by retrotransposition. There are also 
180 solo LTRs, marking former transposition events, compared with 
268 found in S. cerevisiae. The density of transposable element 
remnants on chromosome III of S. pombe is twice that of chromo- 
somes I and II (Table 1). 

We examined 73 genetically and physically mapped genes from 
the three gene maps; comparison of these maps shows that they are 
essentially co-linear and that the level of recombination is similar 
throughout the three chromosomes. More detailed comparisons of 
the genetic and physical maps may reveal subtle variations in 
recombination around centromeres, telomeres, the mating-type 
locus, and sites of meiotic DNA double-strand breaks. Several 
inconsistencies in the genetic maps were identified, including the 
reversal of a chromosome II fragment near the telomere between 



1 

(35 kb) 



2 

(65 kb) 



trpl and 5po4 (ref 46), the relocation of cutl and weel firom the 
telomere region to the centromere region of chromosome III, and 
changes in position of lysl and topi. 

Centromere structures 

The outline structure of the centromeres has previously been 
deduced by Southern blotting and by sequencing about 14% of 
the centromere repeat regions*'"*'. Here, we sequenced most (81%) 
of the three centromeres; this has allowed schematic maps of the 
centromeres to be verified (Fig. 1). The nomenclature used foUows 
that of the Yanagida group^^'; however, other designations of the 
centromere elements have been used". The most complete sequence 
is for centromere 1, which is the shortest at 35 kb and is missing only 
one 2.5-kb fragment. This centromere consists of a central core 
(cntl) of 4.1 kb and 28% GC content, flanked by two 5.6-kb 
imperfect imrl repeats (imrlL, imrlR) with 29% GC content, 
and two pairs of 4.4-kb dg and 4.8-kb dh repeats (dgl, dhl) of 
33_34o/o GC content. A repeat of around 0.3 kb, known as cen 253 
(EMBL X13757), is found adjacent to the dh repeats. The maps of 
the other two centromeres have the same basic structure with 
central cnt regions flanked by imr repeats and by variable numbers 
of dg and dh repeats separated by cen 253, Cnt 1, -2 and -3 share 48% 
identity over a 1,405 -bp region, and dhl, -2 and -3 share 48% 
identity over a 1,811 -bp region. However, the most striking con- 
servation is observed in the dg regions, which share 97% identity 
over a 1,780-bp region. This highly conserved segment represents an 
element that is essential for centromere function; deletion of this 
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Rgure 1 Schematic maps of the three 5. pombe centromeres showing the repeated elements. The key Is given at the bottom of the figure and the relevant clones are indicated under 
each centromere map. The maps are not drawn to scale. 
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region from the dg repeat, termed the K/K" repeat by the Garke 
group, results in a complete loss of centromere activity in both 
mitosis and meiosis". There must be a special mechanism to 
maintain such a high level of sequence conservation between the 
different centromeres. The total calculated lengths of centromeres 1, 
2 and 3 are respectively 35, 65 and 110 kb, inversely proportional to 
the lengths of the chromosomes at 5.7, 4.6 and 3.5 Mb. Possibly 
more extended centromeric regions are required for proper mitotic 
and meiotic behaviour when the chromosome arms are shorter. As 
noted above there are no protein-coding genes in the centromeric 
region but there are many tRNA genes (Fig. 1). tRNA dusters flank 
centromeres 2 and 3 and are also found within the imr regions of aU 
three centromeres^. These tRNA genes might contribute to cen- 
tromere function by defining domain boundaries important for 
centromere activity^. 

The S, pombe centromeres are considerably longer than their 
S. cerevisiae equivalents, which contain a core region sufficient for 
centromere activity of only 120bp*^*^ and a nudease-protected 
region of 150-160 bp including the 120-bp conserved core*'. It is 
not clear why S, pombe centromeres are 300-1,000 times larger than 
their 5. cerevisiae equivalents, but one possibility is that their 
kinetochore structures are different. 

Intergene regions 

The total intergene length distributions for 5. pombe and 5. 
cerevisiae are shown in Fig. 2. The length is calculated from the 
stop codon to the next start codon for tandemly oriented genes, 
from the start codon to the start codon for divergently oriented 
genes, and from the stop codon to the stop codon for convergently 
oriented genes. Intergenic regions in S. pombe have a mode of 423 bp 
and a mean of 952 bp, both longer than the equivalent values for 
S. cerevisiae (200 and 515 bp respectively). Analysis of the divergent 
intergene regions reveals that pairs of upstream regions range in 
length from 200 to 2,100 bp, with a peak between 200 and 1,200 bp 
(Fig. 2). This is longer than the equivalent distributions in S. 
cerevisiae, which range from 200 to 900 bp, with a peak from 200 
to 700 bp (Fig. 2). Analysis of convergent intergene regions shows a 
peak in length for pairs of downstream regions of 200-800 bp for 
S, pombe and 100-500 bp for S. cerevisiae (Fig. 2). Therefore there is 
a smaller difference between the two yeasts for the intergenic regions 
between convergent genes (downstream regions) than for those 



between the divergent genes (upstream regions). 

Several explanations can account for these results. The S* mRNA 
regions may be systematically longer in 5. pombe than in 5. 
cerevisiae, although there is no evidence for this. For example, the 
spacing between tiie TATA-box region and the transcriptional start 
in S, pombe is shorter than that in S. cerevisia^^. Alternatively, the 
promoter regions may be of greater complexity in S. pombe and 
therefore longer. Again there is no direct evidence to support this 
view, but there are other examples of more-extended organization 
of chromatin elements in S. pombe, including larger centromeres 
and regions of DNA replication origin^. The existence of truly 
intergenic spacer regions in 5. pombe is supported by the identifica- 
tion of several 4-8-kb extended gene-free regions, which fall outside 
the broad distribution of lengths associated with average intergenic 
regions. These are low complexity sequences with a (G - C)/(G + C) 
strand switch*'. There are about ten gene-free regions per chromo- 
some, which are usually flanked by tandemly oriented genes. One of 
these gene-free regions, between SPAC4G8.03c and SPAC4G8.04, 
corresponds to a prominent meiotic DNA break site or cluster of 
sites (I. A. Young, R. W. Schreckhise and G. R. Smith, manuscript in 
preparation). 

Introns 

A total of 4,730 introns is distributed among 43% of S. pombe genes, 
with 15 being the largest number of introns found within a single 
gene (Table 2). Introns varied from 29 to 819 nudeotides long, with 
a mean length of 81 and a mode of 48 nudeotides. In S. cerevisiae, 
introns are much rarer, with only 5% of genes having introns. Most 
introns in S. pombe follow the rule of GT donor and AG acceptor, 
but there are three examples that have GC donors". The average 
positions of introns within genes were assessed by mapping them 
with respect to the start and stop codons. This analysis does not take 
into account any introns in 5' and 3' untranslated regions. For the 
genes with 1-6 introns there is a 5' bias from the values expected if 
introns were evenly distributed throughout the genes (Table 2). A 5' 
bias is ako seen in S. cerevisiae, where it has been hypothesized to be 
due to in vivo reverse transcription generating complementary 
DNAs primed from the 3' ends of the mRNAs, followed by 
replacement of the original chromosomal gene with the cDNA by 
homologous recombination". Because cDNAs are extended from 
their 3 ' ends, there will be a tendency for introns at 5' ends not to be 



Total 



Divergent orientation 



Convergent orientation 



S. pombe 




}. r-'T-'cJcJcsieO h 1- ^ CM CM CM CO 



V 



U> C7)_ 

esf CM oj 



2,500 5. cerevisiae 



700 1 




Length of intergene region (bp) 



Figure 2 Intergene regions. Distribution of intergene regions given for all genes and for 
divergent and convergent pairs of genes, for both 5. pombe and S. cerevisiae. A total of 
4.890 intergene regions from 5. pmbe were analysed from a database prepared Just 



before comptetion d the whole genome, and 5.788 intergene regions from 5. cerevisiae 
were analysed. Histograms show the number of regions in 200-bp bins. 
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Table 2 lirtrong per gene and average positions off Intrena wrtthln genes 



Introns per gene 



No. of genes 



Mean gene length M 



Position of introns* 









1 


2 


. 3 


4 


5 


6 


0 


2,683 


1.497 














1 


996 


1.426 


0.26 (0.50) 












2 


614 


1,396 


0.17(0.33) 


0.48 (0.66) 










3 


324 


1.588 


0.13 (0.25) 


0.37 (0.50) 


0.63 (0.75) 








4 


148 


1.633 


0.10 (0.20) 


0.27 (0.40) 


0.50 (0.60) 


0.73 (0.80) 






5 


70 


1.603 


0.08(0.17) 


0.22 (0.33) 


0.37 (0.49) 


0.56(0.66) 


0.77 (0.83) 




6 


40 


2,162 


0.06(0.14) 


0.22(0.28) 


0.34 (0.42) 


0.49(0.57) 


0.66(0.71) 


0.82(0.85) 


7-15 


34 


2.766 















The data sett of 4.677 introns was prepared Just betore completi^ 
-The rnean position of Introns. with the values in brackets represent 



removed from the chromosomal genes. Of genes that have two or 
more introns, 614 have two introns, 324 have three, 148 have four, 
70 have five and 40 have six (Table 2). Thus the number of genes 
having an extra intron decreases by about half as intron number 
increases from two to six per gene. These observations may be of 
relevance to speculations concerning the mechanisms by which 
introns are generated and removed". The relatively large number of 
introns in S. pombe provides opportunities for alternative splicing 
to generate protein variants, which could have regulatory roles as 
well as increasing the range of protein types present in the cell". 

Genome duplications and comparisons 

Comparisons of chromosomal sequences and searches for tracts of 
conserved gene order did not reveal evidence for large-scale genome 
duplications in S, pombe. This differs from reports for 5. cerevisiae 
and Arabidopsisy which have suggested that both of these organisms 
have undergone some large-scale genome duplication^-**. However, 
blocks of duplicated sequence totalling about 50 kb retaining a 
conserved gene order can be found at the sub-telomeric regions of 



a S, pombe 
S.p. proteins 
in S.C. and C.e. 



S.p. proteins 
in S.C. 



S.p. proteins 
in Ce. 




b S, cerevisiae 
S.c. proteins 
in S.p. and C.e. 




S.c. proteins 
in S.p. 



S.c. proteins 
in C.e. 



Rgure 3 Comparison of proteins in S. pombe (S.p.). S. cerevisiae (S.c.) and C. eiegans 
(Ce). a. Pie chart comparing the homology of proteins of S. pombe with those of 
S. cerevisiae and C. elegaris. b. Pie chart comparing the homology of proteins of 
S. cerevisiae}N\Xh those of S. pomteand C, eiegans. For example. S.p. proteins in S.c. and 
C.e. means S. pombe proteins with homotogues found in 5. cer&nsiae and C. eiegans. 
The absolute numbers of proteins are given for both yeasts. 



chromosomes I and II. Twenty-four genes (in groups of two or four) 
are 100% identical at the DNA level, and twenty of these are 
localized in sub-telomeric regions, suggesting frequent exchange 
of genetic information at these positions. Most of these genes code 
for proteins belonging to families specific to fission yeast and are 
predicted to be cell-surface proteins. Interestingly, in S. cerevisiae 7 
of the 16 genes {in groups of two, three or four) that are 100% 
identical at the DNA level are also located in sub-telomeric regions. 
These gene products include members of the budding-yeast-specific 
PAU and COS families, which are also predicted to be cell-surface 
proteins''. In the highly plastic telomeric and sub-telomeric regions 
of malaria and several other protozoan parasites, genes coding for 
species-specific cell-sur£cice proteins are also found, for example, the 
Var, Rifin and Stevor families of Plasmodium fakipamnf^. These data 
suggest that recombination events between telomeric regions maybe 
a major mechanism involved in the generation of organism-specific 
cell-surface molecules. These molecules may also be of importance 
for cell identity and for processes that generate hypervariable cell- 
surface molecules relevant for self and non-self recognition. 

We next compared the proteins of S. pombe with those of the 
unicellular eukaryote 5. cerevisiae and the metazoan C. eiegans 
(Fig. 3), using BlastP^* with a cutoff £-value of 0.001 and no low- 
complexity filtering. Excluding genes coded by the mitochondria 
and transposons, we used a data set of 4,876 proteins from S. pombe, 
5,777 proteins from S, cerevisiae (Cerpep 14 May 2001; ftp://ftp. 
sanger.ac.uk/pub/yeast/SCreannotation/cerpep) and 19,622 proteins 
fix)m C. eiegans (ftp://ftp.sanger.ac.uk/pub/databases/wormpep). 
About two-thirds of the S. pombe proteins ( 3,28 1 ) have homologues 
in common with both S. cerevisiae and C. eiegans (Fig. 3). A smaller 
number, 769 (16%), have homologues in S. cerevisiae but not in 
C. eiegans and many fewer, 145 (3%), have homologues in C. eiegans 
but not in S. cerevisiae, A total of 681 proteins (14%) seems to be 
unique to S. pombe. A comparison between S. cerevisiae and the 
other two organisms gave similar results, with 3,605 (62%) of the 
proteins in common, 918 (16%) found only in S. pombe and 150 
(3%) only in C. eiegans, leaving 1,104 proteins (19%) unique to 
5. cerevisiae. Thus, 5. cerevisiae proteins with homologues only in 
S. pombe total 9 1 8 whereas the reverse comparison totals 769 (Fig. 3) , 
indicating that there might be more gene duplications in S. cerevisiaCy 
accounting for the extra proteins found in this organism. 

To investigate gene duplication fiirther, we carried out an 'all 
against all' comparison using the same protein data sets and NCBI 
BlastClust" (ftp://ncbi.nlm.nih.gov/blast/documents/README. 
be!) to distinguish protein clusters from proteins represented 
uniquely. Of the 4,876 protein-coding genes of S. pombe, 4,515 
have no other sequence relatives within the organism and can be 
considered unique. The remaining 361 are distributed among 
protein cluster groups with two or more members (Table 3). 
Using the same parameters in S. cerevisiae^ 5,061 genes are unique 
and 716 fall into groups with two or more members (Table 3). 
This supports the idea that there is less gene redundancy than in 
5. cerevisiae^ which may help functional analyses of those genes that 
are not duplicated in 5. pombe. 
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represented in S. pombe are those involved in metabolic (12 genes), 
neurological (13 genes), cardiac (1 gene) and renal (1 gene) disease 
(Table 5). 

A simUar analysis in 5. cerevisiae identified 182 proteins with 
similarities to the human disease set, with most of the genes coding 
for these proteins being shared by the two yeasts. Only two of the 
genes (SPAC630.13c and SPBC530.12c), found in S. pombe but not 
5. cerevisiae^ code for proteins with any significant similarity to 
human disease proteins. These are tuberous sclerosis 2 (TSC2), 
involved in cancer, and ceroid lipofuscinosis PPTl, involved in 
metabolism. Both yeasts seem to be similarly useful as model 
organisms for the study of human disease gene function, although 
their differing biologies may favour one organism for certain genes 
and the other organism for other genes. 

Protein domains 

Listed in Table 6 are the ten most frequent protein domains found in 
S. pombct with 1 1 more domains of interest in the top 40 most 
frequent, as determined by InterPro matches'", together with the 
frequency of these domains for the other fuDy sequenced eukaryotic 
genomes. These domains are divided into three categories (1-3). 

The first category ( 1 ) consists of five domains found in the top ten 
most frequent domains in S. pombe that are also found in the top ten 
of at least four of the other eukaryotes. They are the ATP/GTP 
binding site, the WD40 repeat, the eukaryotic protein kinase 
catalytic core, the RNA binding region RNP-1, and the zinc finger 
C2H2-type transcriptional activator. These universal and com- 
monly exploited domains also feature highly in other eukaryotes. 
Because total gene number increases with the complexity of an 
organism, the proportion of these domains is approximately similar 
in each of the sequenced eukaryotic genomes. Energy utilization 
exploiting the ATP/GTP binding site, protein phosphorylation 
dependent on the catalytic protein kinase domain, and transcrip- 
tional activation using the zinc finger C2H2 domain must define 
biochemical mechanisms that are readily exploited to generate new 
biological pathways. 

In the second category (2), the domains are present in a similar 
absolute number in the eukaryotic genomes analysed. Amongst 
those more frequently found in this category are the BRCT, 
replication factor C, minichromosome maintenance proteins 
(MCMs), Fizzy, DNA-directed DNA polymerase ^ family and 
helicase C-terminal domains. Some of these are invoked in core 
cell activities like DNA replication, DNA repair and cell-cyde 
progression, perhaps explaining why they are present in similar 



Table 4 Sehlxotaccharomyeea pombe genes related to human cancer genes 



Human cancer gene 


Score* 


S. pombe ger^product 


Systematic name 


Xeroderma pigmentosum D; XPD 


++++ 


rad15. rhp3 


SPAC1D4.12 


Xeroderma pigmentosum B: ERCC3 


++++ 


rad25 


SPAC17A5,06 


Hereditary non-potyposts colorectal cancer (HNPCC); MSH2 


++++ 


msh2 


SPBC24C6.12C 


Xeroderma pigmentosum F; XPF 


++++ 


rad16, radio. radZO. swi9 


SPCC970.01 


Immunodeficiency: DNA Kgase 1 


++♦+ 


cdc17 


SPAC57A10.13C 


HNPCC: PMS2 


++++ 


pmsl 


SPAC19G12.02C 


HNPCC: MSH6 


++++ 


msh6 


SPCC285.16C 


HNPCC: MSH3 


++++ 


swi4 


SPAC8F11.03 


HNPCC: MLH1 


++++ 


mlhl 


SPBC1 703.04 


HaematologicaJ Checfiak-Higashi syrKlrome; CHSI 


++++ 




SPBC28E12.06G 


Darier-White disease: SERCA 


++++ 


pgak 


SPBC31E1.02C 


Bioom syndrome: BLM 


.++++ 


hus2. rqh1,rad12 


SPAC2G11.12 


Ataxia telangiectasia; ATM 


++++ 


tell 


SPCC23B6.03C 


Xeroderma pigmentosum G; XPG 


+++ 


rad13 


SPBC3E7.08C 


Tutjerous sclerosis 2: TSC2 


+++ 




SPAC630.1X 


Immune bare lymphocyte: ABCB3 


+++ 




SPBC9B6.09C 


Downregulated in aderioma: DRA 


+++ 




SPAC869.05C 


Otannond-Biackfan anaemia: RPS19 


+++ " 


rps19 


SPBC649.02 


Cockayne syndrome 1; CKN1 


+++ 




SPBC577.09 


RAS 


+++ 


ste5. ras1 


SPAC17H9.09C 


Cydin-dependent kinase 4; COK4 


+++ 


cdc2 


SPBC11 810.09 


CHK2 protesn kinase 


+++ 


cdsl 


SPCC18B5.11C 


AKT2 


+++ 


pck2. sts6, pkcl 


SPBC1 201 2.040 



* Scores are: ++♦+. <1 x 1(r ♦++. 1 x lOr* to 1 x lOr'". 



TabiA 3 OeiiA duDtlcati 


an In S. in 

VII III w« JV^ 


itnbB and Sm cersWslee 




Protein mernbers 




No. of clusters 


No. of dusters 


per duster ' 




In S. pombe 


In S. cerevisiae 


1 




4,515 


5,061 


2 




124 


256 


3 




17 


28 


4 




8 


11 


5 




2 


3 


6 




1 


1 ■ 


7 




2 


1 


>7 




0 


3 


Total no. of clusters 




4.669 


5.364 


Total no. ot sequences 




4.876 


5.777 



Protein dusters were identified with NCBI BlastClust using parameters S1 0.U).9. as recommended 
by Y. Wdf (personal communicatiort) . We used databases of 4,876 S. pombe proteins prepared just 
before completion of the genome sequence and of 5.777 S. cerevisiae proteins. 



Human disease genes 

To assess the usefulness of 5. pombe for investigating the functions of 
genes related to human disease, we used the same method and 
dataset of human disease genes as that employed for analysis of the 
Drosophila genome^'. Protein-coding genes of S. pombe were iden- 
tified that generate products with similarities to proteins coded by 
289 genes that are mutated, amplified or deleted in human disease. 
A total of 172 S. pombe proteins have similarity with members of 
this data set of human disease proteins, and 122 of these have E- 
values greater than 1 x 10"^. These values indicate that either they 
are not significant or they have only limited similarities with the 
equivalent human proteins, reflecting, for example, shared domains 
such as related protein -interacting regions or catalytic sites. How- 
ever, despite this limitation, they may stiU be useful for investigating 
the biochemical activities and interactions of human disease pro- 
teins in S. pombe. The other 50 S. pombe proteins (Tables 4 and 5) 
have E-values lower than 1 x 10"^. The more significant similarities 
seen with this class mean that genes coding for these proteins are 
more likely to be useful for investigating not only the biochemical 
but also the biological functions of the human genes, and some 
could provide good models for studying the associated human 
disease pathways. The largest group of human disease-related genes 
are those implicated in cancer. There are 23 such genes (Table 4), 
and they are involved in DNA damage and repair, checkpoint 
controls, and the cell cycle, all processes involved in maintaining 
genomic stability. The cell cycle and checkpoint background of 
S. pombe make it a good model organism for studying these 
particular cancer disease pathways. Other categories that are also 
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Table 6 ScMwseccharBinyces pomfle genes related to hunwin disease genes 



Human disease gene 


Disease 


Score* 


S. pombe gene/product 


Systematic name 


ffiseasfi' ATP7B 


Metabolic 


++++ 


P-type copper ATPase 


SPBC29A3.01 




Metabolic 




krp1 , kinestn related 


SPAC22E12.09C 


Munorinoi ritniun* ABCC8 
riy^xff H ioujh hoi 1 1| ^lOWWO 


Metabolic 


4.^4.4. 


ABC transporter 


SPAC3F1 0.11c 




Metabolic 




2rwfl GP6 dehydrogenase 


SPAC3A12.18 




Metabolic 




Argininosuccinate synthase 


SPBC428.05C 




Metabolic 




Translotolase 


SPBC2G5.05 


\/Qrionato rv^m^l\/r^{^• PPOX 


MetEibollc 


+++ 


Protoporph^nogen oxidase 


SPAC1F5.07C 


K>l jiti irhu-nn<u«t riiahetas of ttw \/niinn fMOnVPV CiClK 


Metabolic 


4-4-4- 


hxkl, hexokinase 


SPAC24H6.04 


Qiteifran's syndrome! SLC7^U 


Metabolic 


444 


CCC Na-K-CI transporter 


SPBC18H10.16 


Cystinuria type 1 : 5LC34 7 


Metabolic 


44>4 


a^glucosidase 


SPBC1 683.07 


Cwstic fibrosis* iASCCZ 
wyou^ iiMiwoiOi t 


Metabolic 


444 


fiGC transporter 


SPBC359.05 


RflrttAr'fi Qvnrlromfi" SLCl 2A1 

DOI IVOI 9 9yi lUJ it^ 1 iO| WbW f f 


Metabolic 


444 


CCC Na-K-CI transporter 


SPBC18H10.16 


MnnkoQ wnrtromfl' ATP7A 


Neurologicai 


44-4-4- 


P-tn>e copper ATPase 


SPBC29A3.01 


VOOII lOOOt V lOl OUIiCU y 1 Irf f w f w 


NeurolOQical 


4-4+4- 


myoSI dass V myo^ 


SPBC2D10.14C 


7olluupnar eunrlmmo* PF)(1 


Neu roloQical 


444 


AAA-family ATPase 


SPCC553.03 


^^nmeon Hteaasa' ^/ ^Aff 

1 nornsen uisooraBi wLwfv i 




4-44- 


QC chtoride channel 


SPBC19C7.11 


Qniiwiprohpllar fltnvlfl tuna R f^nAR\' f^HA/ii lil 




44+ 


VIC socfium channel 


SPAC6F6.01 


Myotonic dystrophy; DM7 


Neurological 


444 


orb6 Ser/Thr protein kinase 


SPAC821.12 


McCune-Albright syndrome; GNAS1 


Neurological 


444 


gpa1 guanine nix^ieotide binding 


SPBC24C6.06 


Lowe's oculocerebrorenal syndrome; OCRL 


Neurological 


444 


PIP phosphatase 


SPBC2G2.02 


Dents: CLCA/5 


Neurological 


44+ 


CC dikxlde channel 


SPBC19C7.il 


Coffln'Lowiy:flPS®C43 


Neurological 


44+ 


Ser/Thr protein kinase 


SPCC24B10.07 


Angelman; UBE3A 


Neurological 


+4+ 


Ubiquitin-protein ligase 


SPBP8B7.27 


Amyotrophic lateral sclerosis: SOD 7 


Neurological 


444 


sodi . superoxide dismutase 


SPAC821.10C 


09Uchitype2:FtHK/A/ 


Neurological 


444 


Ser/Thr protein kinase 


SPCC24B10.07 


Familial cardiac myopathy; MYH7 


Cardiac 


44+4 


myo2. myosn II 


SPCC645.05C 


Renal tubular acidosis; ATP6B 1 


Renal 


444+ 


V-type ATPase 


SPAC637.05C 



•Scores are: +444, <1 x KT*": 444. 1 to 1 x 10"™, 



absolute number regardless of genome size''. Systematic searches 
for other domains present in similar absolute numbers in genomes 
of all eukaryotes might identify other, at present unrecognized, 
functions involved in similar core ceU activities. 
'. The third category (3) includes domains whose occurrence rises 
dramatically with increasing genome size within the Metazoa. This 
category includes the SH3, PH and tyrosine/dual-specificity phos- 
phatase domains. These are involved in intra- and intercellular 
signalling pathways, which might be expected to become increas- 
ingly elaborate as muhicellular complexity increases"'^'. 

Two other domains in the top ten for both the yeasts are the sugar 
and ABC transporters (Table 6). S. cerevisiae has significantly more 
of these domains and the amino-acid permease domain than does 
S. pomhi^y which may explain why it is a more versatile organism, 
growing on a greater range of media. The Zn{ll)Cys(6) transcription- 



factor domain is found only in the two yeasts, supporting the idea 
that it is specific to fungi. The chromodomain is found more 
frequently in S. pombe — seven examples compared with two in 
S. cerevisiae — possibly reflecting differences in higher-order chro- 
matin structure. 

Defining tlie eulcaryotic cell 

The genome sequence of S. pombe increases the range of 
available complete eukaryotic genome sequences to two uni- 
cellular free-living organisms (5. cerevisiae and 5. pombe), one 
plant {Arabidopsis), and three metazoans (C. elegans, Drosophila 
and humans). This range of organisms allows a comparison 
between eukaryotic and prokaryotic genomes (represented by 37 
bacteria and 8 archaea), with the intention of identifying those 
genes important for eukaryotic cell organization. We have made an 



Table 6 Protein domain analysis and comparison with other eukaryotes [ 

Interpro S. pombe S. cerevisiae H. sapiens 0. mefeinogaster C. ateigans A tfiaftana Interpro name 

accession no. 





Proteins 


Rank 


Proteins 


Rank 


Proteins 


Rank 


Proteins 


Rank 


Proteins 


Rank 


Proteins 


Rank 






IPR001687 


213 


1 


267 


1 


436 


5 


231 


4 


191 


7 


331 


5 


ATP/GTP-binding site motif A (Ploop) 


1 


IPR001680 


114 


2 


97 


3 


277 


8 


183 


5 


102 


19 


210 


10 


G protein p WD40 repeats 


1 


IPR000719 


111 


3 


119 


2 


579 


3 


377 


2 


450 


2 


1.049 


1 


Eukaryotic protein kinase 


1 


IPR000504 


80 


4 


61 


5 


307 


7 


182 


6 


97 


21 


255 


8 


RNA binding region RNP1 


1 


IPR001650 


67 


5 


63 


4 


155 


20 


101 


17 


80 


27 


148 


13 


Helicase C-terminal doman 


2 


IPR001841 


44 


6 


33 


12 


215 


15 


120 


11 


126 


12 


379 


4 


RING finger 




IPR001440 


38 


7 


33 


12 


150 


21 


92 


18 


46 


43 


125 


17 


TPR repeat 




IPR001066 


36 


8 


46 


8 


44 


64 


45 


34 


55 


37 


98 


26 


Sugar transporter 




IPR001617 


33 


9 


42 


9 


75 


40 


67 


28 


61 


36 


103 


25 


ABC transporter family 




IPR000822 


32 


10 


51 


7 


712 


2 


403 


1 


154 


10 


115 


20 


Zinc finger. C2H2 type 


1 


IPR001357 


14 


23 


10 


30 


24 


82 


17 


61 


25 


60 


17 


83 


BRCT domain 


2 


IPR000862 


8 


29 


9 


31 


8 


98 


9 


68 


6 


79 


13 


87 


RepOcatton factor C conserved domain 


2 


IPR002064 


5 


32 


5 


35 


4 


102 


6 


70 


3 


82 


5 


95 


DNA directed DNA polymerase family p 


2 


IPR001208 


6 


31 


6 


34 


12 


94 


13 


64 


5 


80 


8 


92 


MCM family 


2 


IPR000002 


5 


32 


3 


37 


3 


103 


4 


72 


2 


83 


6 


94 


FIZZY/CDC20 domain 


2 


IPR001452 


21 


16 


23 


18 


220 


14 


82 


23 


62 


35 


3 


97 


Src homology 3 (SH3) domain 


3 


1PR001849 


21 


16 


26 


16 


253 


11 


89 


22 


75 


31 


27 


73 


PH domain 


3 


IPR000387 


9 


28 


11 


29 


112 


29 


47 


40 


110 


16 


21 


79 


Tyrosine-specific protein phosphatase and 


3 




























dual-specifictty protein phosphatase family 




IPR001138 


27 


13 


52 


6 


0 


NA 


0 


NA 


0 


NA 


0 


NA 


Fungal transcr1ptk)nal regulatory protein 




IPR002293 


21 


16 


32 


13 


43 


65 


36 


45 


32 


54 


65 


42 


Permease for amino adds and related compounds 




IPR000953 


7 


30 


2 


38 


26 


80 


20 


58 


15 


70 


24 


76 


Chromodomain 





Domain identifierB are from tnterPro. wtiich integrates PROSTTE. PRINTS and PFAM. Only domains within the most frequent 40 found in S. pomte are given. The numbers of proteins with these domains and 
their ranking is given for S. pombe and the other eukaryotes Pst ed . At the right end of the table Is a classification of 1 -3; see text tor an expianatkxi. NA. not applicable. 
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Table 7 Identifying conserved genes 


> Important ft 


MT defining th 


e eukaryotic 


cell and multicellularlty 








Sintilarity No. of genes . 


&20% 


2515% 




(a) Genes defining eukaryotic organization 








50% 184 


62 


47 


41 


45% 245 


86 


63 


55 


40% 311 


113 


81 


70 


(b) Genes defining muiticeaularity 








50% 397 


1 


1 


1 


45% 511 


2 


1 


1 


40% 647 


3 


2 


2 



The same data set for assessing gene duplication was used. Protein data sets weie Identified with 
40%, 45% and 50% similartty for iiumans. Drosophiia. C. efegans. S. pomOe and S. cerevisiae in a 
or tor human, Drosophiia, C. elegans arydArabidopsis in b. The Btast-calculated bit score describes 
the similailly between two sequences. For two identical sequences (a compared to a) the bit score is 
100%. For different sequences (a compared with b) the measure of similarity is bit score (ab)/bit 
score {Ba)x 100. The numbers witiiin these data sets are not found in any of the fully sequenced 
prokaryotes (45 In totaO in a, or any of the prokaryotes and the two yeasts In b at sin^rfty lev^ 
12%, 15% and 20%. The 45 prokaryotes include genomes from 37 Eubacteria and 8 Archaea. 



initial analysis to identify the more conserved genes falling in this 
category by comparing the predicted protein sequences coded by 
the above genomes. The percentage similarity was derived from the 
hit bit score divided by the self bit score for each protein (see Table 7 
legend). We selected those proteins with a high percentage similarity 
score in all of the eukaryotes, and a low one in all of the prokaryotes. 
Three thresholds (50%, 45% and 40%) were used to identify 
proteins that are highly conserved in the fully sequenced eukaryotes 
and three corresponding thresholds (20%, 15% and 12% respec- 
tively) to identify proteins not found in the fully sequenced 
prokaryotes (Table 7a). For an initial discussion of these proteins, 
thresholds of 50% and 20% were selected. This analysis identifies 
genes coding for proteins that are highly conserved in yeasts, plants 
and metazoans (by using a threshold of 50% similarity) and yet are 
not well conserved in prokaryotes (by using a threshold of 20% 
similarity). The proteins identified using these criteria are likely to 
be important for maintaining eukaryotic cell organization, 
although the high threshold of 50% means that other proteins 
required for this may well be excluded. 

Using these thresholds, 62 genes were identified and grouped 
according to function (Table 8). More information about these 
genes can be found on the GeneDB website (http://www.genedb. 
org/pombe) and the PombePD website {http://proteome.com/ 
databases). Two of these groups code for proteins associated with 
characteristics considered to distinguish eukaryotic cells from 
prokaryotic cells: the organization of DNA in chromosomes 
within a nucleus, and the formation of 40S and 60S ribosomal 
subunits, which are larger than the prokaryotic 30S and SOS 
subunits. The first group includes the H3 and H4 core histone 
proteins required for packaging DNA into nucleosomes, the Hdal 
histone deacetylase, which suggests histone acetylation is critical 
for eukaryotic chromatin, and the Ran GTPase Spil, a key element 
for nuclear membrane transport. One putative protein in this 
category (SPAC890.07c) is possibly involved in export of mRNA 



binding proteins and another may be localized in the nucleus 
(SPCP1E11.08). The second group includes two Rps and six Rpl 
proteins, components of the 40S and 60S ribosomal subunits 
respectively; these eight proteins may contribute to differences in 
protein translation between prokaryotes and eukaryotes. 

Two further groups in Table 8 are relevant for the more elaborate 
organization and compartmentation of eukaryotic cells. One con- 
sists of cytoskeletal proteins, the actins Actl and Act2, the tubulins 
Nda2, Nda3 and Tubl, and the cytoskeleton-associated proteins 
Arp2 and Cdc42. The actin and tubulin polymers provide not only 
internal structure but also the means for transport of components 
and information from one region of the cell to another, important 
matters given the increased size of eukaryotic cells. The bacterial 
FtsA, Hsp70 and FtsZ proteins have structures with similarities 
respectively to actin and tubulin but only very limited primary 
sequence similarities""'^ Arp2 is an actin-related protein required 
for actin organization, and the Cdc42 GTPase is a signalling 
molecule important for cell shape and for communicating signals 
from the cytoskeleton. One protein (SPAC926.07c) is predicted to 
be a dynein light chain. The second group consists of GTP binding 
proteins and their regulators Yptl, -2, -3 and -7, Arfl, Apsl, Gdil 
and Sari, which are required for membrane transport. Membrane- 
bound organelles and structures are characteristic features of 
eukaryotic cells, and membrane fusion and fragmentation are 
important in oi^anelle formation and function. Caml (calmodu- 
lin) is a protein that exploits compartmentalization of Ca^^ to 
regulate ceUular processes. One protein (SPBC1539.08) is a putative 
ADP ribosylation factor and may be involved in transport. 

A small group (Table 8) includes cell-cycle and checkpoint 
control proteins. The Cdc2 protein kinase (Cdc28 in S. cerevisiae) 
is a cyclin-dependent kinase (CDK) controlling the onset of S-phase 
and mitosis in the two yeasts, with closely related CDKs controlling 
these cell-cycle transitions in other eukaryotes. The CDK system for 
cell-cycle control evolved with the appearance of eukaryotic cells, 
whose cell cycle differs from prokaryotes in two ways: DNA 
synthesis, which uses multiple origins of replication, and mitosis, 
which brings about chromosome segregation. It has been argued 
that, in the primeval eukaryote, there was a single CDK that 
underwent a monotonic change during the cell cycle, initiating S 
phase early in the cycle at a low activity and mitosis late in the cycle 
at a high aaivit/*. Two checkpoint proteins, Rad24 and Rad25, are 
14-3-3 proteins thought to regulate the Cdc25 phosphatase con- 
trolling the Cdc2 CDK'^. If DNA becomes damaged then these 
checkpoint proteins prevent the onset of mitosis until the damage is 
repaired. This pathway is essential for maintaining genomic stability 
and seems to be characteristic of eukaryotic cells. 

Three further groups reflect biochemical processes that are 
important in eukaryotic cell regulation. The first group consists of 
Lsm2 and Smd2, which are required for RNA splicing. The 
second group consists of the Ubc, Ubi and Ubl proteins together 
with Uipl and Padl (Table 8), all required to bring about controlled 
proteolysis of proteins. A further protein putatively involved in 
proteolysis is a prohibitin complex subunit (SPACl 782.06c). The 



Table 8 Classification of conserved genes Important for deflnlne the eukaryotic cell 



Nucleus 


Ribosomal 


Cytoskeleton 


Compartmentation 


Cellcyde 


Splicing 


Proteolysis 


Kinase/ 
phosphatase 


h3.1 


rpl18 


actl 


ypt1 


cdc2 


Ism2 


ubc13 


ckal 


h3.2 


rpl27 


acta 


ypt2 


rad24 


smd2 


ubc4 


dis2 


h3.3 


rpt27A 


arp2 


ypt3 


rad25 




ubil 


hhpl 


h4.1 


fpl29 


odc42 


ypt7 






ubi4 


ppal 


h4.2 


rpl7A 


nda2 


aps1 






ubIl 


ppa2 


h4.3 


rpl7 


nda3 


arfl 






uepi 


ppel 


hdal 


rps3A 


tub1 


caml 






hus5 


sds21 


Spil 


rps21 


SPAC926.07C 


gdil 






padl 


SPBC26H8.05C 


SPAC890.07C 




sari 






rhp6 


SPAC22H10.04 


SPCP1E11.08 






SPBC1 539.08 






SPACl 782.060 





Miscellaneous 



SPBC24C6.11 
SPBP8B7.24C 



The 62 proteins from Tabte 7a (50% versus 20%) are dassilied according to their primary function as described in the text. For putative functions, orty the gene location ts given. 



878 



^©2002 Macmlilan Magazines Ltd 



NATURE I VOL 4 1 5 1 2 1 FEBRUARY 2002 1 www.nature.com 



articles 



third group consists of protein kinases and phosphatases, and 
includes deal, Dis2, Hhpt, Ppal, Ppa2, Ppel and Sds21 and 
putative serine/threonine protein phosphatases (SPAC22H 10.04 
and SPBC26H8.05c). The presence of these three regulatory pro- 
cesses unique to eukaryotic cells allows protein levels and activities 
to be specifically and rapidly changed without relying on changes in 
transcription rate. In prokaryotic cells, gene regulation often oper- 
ates through changes in transcription rate, followed by dilution of 
remaining proteins as a consequence of rapid cellular growth. The 
slower growth rates of eukaryotic cells means that mechanisms in 
addition to dilution by growth are required to modulate protein 
activity; these mechanisms may be provided by RNA splicing, 
proteolysis and phosphorylation. 

Two genes code for a putative zinc-finger protein (SPBC24C6.1 1) 
with a possible role in cell polarity and a putative autophagy protein 
(SPBP8B7.24c) that may mediate attachment of autophagosomes to 
microtubules. Extension of this analysis at different thresholds of 
similarity should identify further proteins of unknown function 
that are important for eukaryotic cell organization. 

We performed a similar analysis to identify highly conserved 
genes that may be important for maintaining multicellular eukar- 
yotic organization (Table 7b). We compared the proteins in prokar- 
yotes and in S. cerevisiae and S. pomhe^ which are all unicellular, with 
those of C. elegans, Drosophila, Arabidopsis and humans, which are 
all multicellular. The same thresholds were used to identify those 
proteins that are highly conserved in the four multicellular eukary- 
otes (50%, 45% and 40%) and to identify which of these proteins 
were not found to be highly conserved in the unicellular organisms 
(20%, 15% and 12%). The number of genes coding for proteins that 
fall into these categories was very small: one to three depending on 
the thresholds used. These genes code a putative transcription 
factor, an RNA-binding protein and a selenium-binding protein. 

As more sequences become available, the groups of genes we have 
identified as being important for eukaryotic and multicellular 
organization will inevitably be modified. However, our results 
allow us to speculate on the evolutionary transitions fi*om prokar- 
yotes to eukaryotes and to multicellularity. The transition to multi- 
cellularity may not have required the evolution of many new genes, 
absent from unicellular organisms. The pathways necessary for 
multicellular organization could already have been in existence in 
unicellular eukaryotes. For example, interceDular signalling may 
have been solved by the sexual needs of primeval, single-celled 
eukaryotes to seek out and identify an appropriate mating partner. 
Once signalling between cells had evolved, it could be readily 
exploited to generate the signalling pathways required for multi- 
cellular oiganization. The highly conserved genes specific to eukar- 
yotes may be necessary for eukaryotic ceU organization to be 
generated. In contrast, the transition from unicellularity to multi- 
ceUuIarity may not have required many new genes. Instead it may 
have used genes already present in unicellular eukaryotes, perhaps 
by the shuffling of functional domains, to give rise to new combina- 
tions, which allowed the development of pathways required for the 
evolution of multicellularity^-*'*^' ''". If these speculations are correct, 
they imply that the evolutionary transition from unicellular pro- 
karyotic to unicellular eukaryotic life may have been more complex 
than the transition to multicellular life. This might provide some 
explanation as to why it took around 2,300 million years (Myr) to 
evolve from the first prokaryote to the first eukaryote (thought to 
have arisen about 3,800 Myr and 1,500 Myr ago, respectively) but 
only 500 Myr for the evolution of the first multicellular organisms, 
which arose about 1 ,000 Myr ago. Further analyses and comparisons 
should continue to be illuminating about this interesting question 
of which genes define eukaryotic cells and which define multi- 
cellular organisms. □ 
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In this Article, the author Andreas DOsterhdft was mistakenly 
omitted: his name and affiliation (footnote 6) should have been 
inserted between M. Fuchs and C. Fritzc in the author list. In 
addition, the name of L. Cerutti (in the last line of the author list) 
was misspelled. On p874 in the penultimate sentence of the 
'Intergene regions* section, "tandemly oriented genes" should read 
"divergently oriented genes " □ 



Probing the free-energy surface for 
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molecuie fluorescence 
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The upper limit on the polypeptide reconfiguration time (to) was 
inadvertently calculated using (a^pp - aof* instead of (ojpp — aj), 
as given in the formula in the text (page 745, right column). The 
correct upper limit is therefore 0.2 ms. This results in a lower limit 
on the free energy barrier (A) of Ik^Z corresponding to an 
activation entropy of +3/cb (page 746, right column), and an 
upper limit on the pre-exponential factor (27rro) of 1 ms. This 
mistake does not affect any of the conclusions. We thank Taekjip Ha 
for bringing it to our attention. □ 
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